Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fastcdc #11

Closed
wants to merge 5 commits into from
Closed

Fastcdc #11

wants to merge 5 commits into from

Conversation

dpc
Copy link
Collaborator

@dpc dpc commented Jul 30, 2017

No description provided.

@dpc
Copy link
Collaborator Author

dpc commented Jul 30, 2017

This is on top of #10 .

Still a WIP - I need to confirm that it actually works as expected etc.

@dpc
Copy link
Collaborator Author

dpc commented Jul 30, 2017

@RAOF You've mentioned that you're planing to work on FastCDC. Currently, I'm not even sure if I correctly understood the whitepaper, but here it is, in case you're interested.

@dpc dpc force-pushed the fastcdc branch 3 times, most recently from 51a067c to 4b9f37e Compare August 1, 2017 20:29
@aidanhs
Copy link
Owner

aidanhs commented Aug 2, 2017

@dpc
Copy link
Collaborator Author

dpc commented Aug 2, 2017

Oh, nice! There was so much stuff that I was unclear about, and I think here it actually might be right.

It doesn't look to me it is going to be fast though. The API enforces copying, and nested conditions inside a loop: https://github.com/dswd/zvault/blob/master/chunking/src/fastcdc.rs#L103 are probably not going to get optimized away. As far as I understand the bigest chunk of speedup was being able to skip rolling through the initial bytes. Normalized Chunking was supposed to help counter-effect the deduplication ratio loss, without affecting the speedup.

At very minimum, I should write a test that compares the two implementations, and then beat the code into submission until both versions yield the same results. :)

@dswd I need more time to review but in the meantime: https://github.com/dswd/zvault/blob/master/chunking/src/fastcdc.rs#L24 - what is the deal with these masks? In Ddelta paper Gear was described as just using most significant bits, so why can't it be used in FastCDC?

What do you think about statically including the lookup table? It's just 2K.

@dswd
Copy link

dswd commented Aug 2, 2017

Hi, I reviewed my code and implemented some improvements

  • I added some more tests to verify that changes to my code still result in the same chunk sizes.
  • I moved most conditions out of the loop, similar to the code in this PR

On the masks: The paper describes the masks for 8k avg size and mentions that spreading the 1 bits out in the masks yields a slightly better deduplication. So that is what I am doing in that method: I am using a simple PSRNG seeded to the given seed to set the correct number of random bits on the masks. The major thing here is that the masks depend on the seed, otherwise you could just use a static one.

On static lookup table: I can't do this as my table is seeded, but I don't see a problem with a static table if you do not need this feature. However, the computing effort in calculating the table is negligible and I doubt it will yield any improvement in accessing it during chunking.

On the performance: On my laptop, my optimized code reaches over 1000 MB/s while the code of this PR only reaches below 500 MB/s (my previous unoptimized code had the same speed). I don't know exactly why but I suspect that my dummy sink allows the compiler to omit the costly memcpy calls. Surely this is not the most elegant way to get the positions but it seems fast enough. I have no clue why your code is so much slower; it should be at least as fast as mine.

@dpc
Copy link
Collaborator Author

dpc commented Aug 2, 2017

Very interesting! We should definitely investigate how far we can push our two versions. :)

I glanced quickly https://github.com/dswd/zvault/blob/master/chunking/benches/all.rs#L107 , and I am suspicious that DevNull allows compiler to unfairly (unlike in th real code) optimize everything away.

Similiarly in my benchmarks, there is such risk, but since it is not that much faster, I haven't thought about it.

At the bare minimum we should put the results into https://doc.rust-lang.org/1.1.0/test/fn.black_box.html so prevent (somewhat, helpfully) unfair opitmizations.

Copy link

@RAOF RAOF left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some nits, and suggestion for later refactoring

if self.current_chunk_size < avg_size {
let roll_bytes = cmp::min(avg_size - self.current_chunk_size,
buf.len() as u64);
let result = self.gear.find_chunk_edge_cond(buf, |e: &Gear| (e.digest() >> min_shift) == 0);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is missing the padding-zeros optimisation for deduplication efficiency (likewise the large-chunk calculation)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't get the padding zeros. Since the underlying Gear (as described in Ddelta paper) is taking most significant bits, isn't that the ultimate padding? I was so confused but this.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, you're quite right. I've misread this; the window here is essentially the width of the digest; 64 bytes.

The authors of the paper describe the mask they use in the algorithm as being empirically derived, but infuriatingly not giving any details about it. You'd think that taking the largest window would be best, but apparently not? Apparently it works best when the contributing bits are split approximately uniformly across the 64bit digest?

Copy link
Collaborator Author

@dpc dpc Aug 3, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it works best when the contributing bits are split approximately uniformly across the 64bit digest?

But why? :D I am not very academically minded person, but I found this paper rather confusing is many places. A lot repeating the obvious, and glancing over the important details.

Well, @dswd has the correct implementation https://github.com/dswd/zvault/blob/master/chunking/src/fastcdc.rs here, so we can just use it. :)


let mut cur_offset = 0usize;

loop {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this loop for? As far as I can tell it will loop exactly once, because all codepaths return?



impl Engine for FastCDC {
type Digest = u64;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This really emphasises that trait Engine is an incorrect abstraction for CDC. FastCDC doesn't actually have a Digest, nor can you roll a single byte.

(Similarly, AE and MAXP can't even pretend to have a digest, because they're not even approximately hash-based)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True. However Engine was all we have ATM. :)

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting point, I had only been looking at rolling checksums (hence the crate name) and hadn't really thought about non-checksum based alternatives for doing chunking.

@RAOF
Copy link

RAOF commented Aug 3, 2017

But why? :D I am not very academically minded person, but I found this paper rather confusing is many places. A lot repeating the obvious, and glancing over the important details.

That's an excellent question! This appears to be a recurring problem in CDC papers I've read.

It would have been really nice to see their derivation of their masks - what did they try? What was the range of deduplication efficiencies? How stable is their choice - is there a dataset upon which it is really good, and others on which it's mediocre, or is it reliably good for all datasets they tried? 🤷‍♀️

@dswd
Copy link

dswd commented Aug 3, 2017

My new code is using black_box which does not seem to change the performance. Also I changed my dummy sink to calculate the split positions to be comparable to your code.

Speaking about the curios masks, I am also very unsure why they want to distribute the bits uniformly. It seems they do not really want to have a 64 byte window.


// ignore bytes that are not going to influence the digest
if self.current_chunk_size < ignore_size {
let skip_bytes = cmp::min(ignore_size - self.current_chunk_size, buf.len() as u64);
Copy link
Collaborator Author

@dpc dpc Aug 3, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doing this mins outside of condition might perform better. Like here: https://github.com/dswd/zvault/blob/master/chunking/src/fastcdc.rs#L99

@dpc
Copy link
Collaborator Author

dpc commented Aug 3, 2017

@dswd I will investigate your code and benchmarks more at some point, but as it is right now I actually like your code somewhat better anyway. :D

The one thing that I don't like are these Read and Write arguments. I am still scratching my head how your code can be faster with that copying there. Probably it gets totally optimized away. But in real code it might not. In rdedup I will already have the data in Vec in SGData or something like that. And I want to use chunks in place, without copying anywhere.

Also, in my implementation, I do roll through the 64 bytes before min_size as they do affect the checksum for first bytes after min_size. I doubt it matters in practice, so maybe it could be ignored for performance. Donno.

In my version I use 8 for both min_size and max_size which leaves your code with bigger min_size so there's more to skip ... Probably not going to affect the performance that much.

OK. Will get back to it at some point. :)

@dswd
Copy link

dswd commented Aug 3, 2017

The Read and Write types have the main advantage that it will be optimized away if the Write type is ignoring the actual data (at least it seems so) and if you really need to copy the data somewhere you can copy it directly to where it should go (by giving the destination Write) instead of some buffer.

I thought about including data before min_size too, but in the end those bytes do not matter. With the chunk sizes we are talking about (8k+) the chances of missing a cut point in the first bytes after min_size because of the 64 bytes before it are negligible.

In my experiments, skipping the min_size bytes did not give so much of a boost. Theoretically, it should give up to 25% as I am skipping this much input data but for me it only gives about 15%. Maybe this is because those bytes would only be hashed but the hash is not compared against any mask. So if you are skipping 12.5% instead, my maximal speed benefit can be about 12.5%, but I would guess that it is below 10%.

@aidanhs
Copy link
Owner

aidanhs commented Aug 3, 2017

Just a minor note - please don't copy code directly from @dswd as our licenses are not compatible (assuming the chunking code has the same license as zvault itself) :)

@dswd
Copy link

dswd commented Aug 3, 2017

The chunking code, especially the fastcdc part, is not that much code, so I don't object to MIT/BSD licenses.

I hereby grant permission to use my chunking code (https://github.com/dswd/zvault/tree/master/chunking) under the terms of either the MIT license or BSD license.

If you plan a combined crate with lots of chunking algorithms, I am also happy to move my code there and use that crate as dependency as long as it provides an efficient interface for my use case.

@dpc
Copy link
Collaborator Author

dpc commented Aug 4, 2017

The above are inner-loops from objdump -D dumps.

Mine :

   107a5:       49 8b 14 24             mov    (%r12),%rdx
   107a9:       48 01 d2                add    %rdx,%rdx
   107ac:       0f b6 30                movzbl (%rax),%esi
   107af:       48 ff c0                inc    %rax
   107b2:       48 03 14 f7             add    (%rdi,%rsi,8),%rdx
   107b6:       49 89 14 24             mov    %rdx,(%r12)
   107ba:       48 ff c3                inc    %rbx
   107bd:       49 85 d6                test   %rdx,%r14
   107c0:       75 de                   jne    107a0 <_ZN7rollsum7fastcdc7FastCDC15find_chunk_edge17hca396ceab3902345E+0x1c0>

@dswd's:

    f292:       4d 01 e4                add    %r12,%r12
    f295:       0f b6 84 1d 90 d7 ff    movzbl -0x2870(%rbp,%rbx,1),%eax
    f29c:       ff 
    f29d:       48 ff c3                inc    %rbx
    f2a0:       4c 03 a4 c5 68 cf ff    add    -0x3098(%rbp,%rax,8),%r12
    f2a7:       ff 
    f2a8:       4d 85 c4                test   %r8,%r12
    f2ab:       75 d3                   jne    f280 <_ZN4test13ns_iter_inner17h7be6bdb8f088759eE+0x5b0>

In mine code compiler fails to use register r12 directly for storing the digest, and keeps reading and writting it back through a pointer (%r12). Since that's two needless additional memory accesses vs just one that's necessary, it is plausible it would cause a difference bewteen 700MB/s and 2100MB/s that I see on my box.

Also there is an additional inc instruction, probably from enumerate() in find_chunk_edge_cond. I think find_chunk_edge_cond is also the reason why compiler fails to optimize better - and since I use in in benchmarks for gear, and bup, they are all affected so the benchmarks make such relatively small differences between them (still measurable, but smaller than expected).

find_chunk_edge_cond is a crappy method in the first place, and I am ashamed that I've invented it, IIRC. :D

@dpc
Copy link
Collaborator Author

dpc commented Aug 4, 2017

The good part is that 2100MB/s seems doable (i was afraid it's just a misunderstanding of some kind), so it will push the bottleneck that I see often in rdedup quite far away.

@dpc
Copy link
Collaborator Author

dpc commented Aug 4, 2017

find_chunk_edge_cond is not the problem. Seems like that digest field in the struct/trait is. Compiler just insists on updating it in place...

@dpc dpc mentioned this pull request Aug 5, 2017
@dpc
Copy link
Collaborator Author

dpc commented Aug 5, 2017

Since I've failed to find a way to get the compiler to optimize optimally, I've filled: rust-lang/rust#43653

@dpc
Copy link
Collaborator Author

dpc commented Aug 6, 2017

The fix is relatively simple, at least for gear.

89010c1

Before:

test gear::tests::bench::perf_1mb ... bench:   1,421,243 ns/iter (+/- 4,696) = 737 MB/s

After:

test gear::tests::bench::perf_1mb ... bench:     544,688 ns/iter (+/- 28,115) = 1925 MB/s

Comparing to fastcdc by @dswd

bench:     478,465 ns/iter (+/- 28,868) = 2191 MB/s               

The difference (I'm guessing) is because fastcdc is generaly faster because it's skipping some data, which gear is not doing. I'll redo fastcdc once we settle on API and #12 in general. Than we can measure again, and decide which impl to take.

@dpc
Copy link
Collaborator Author

dpc commented Aug 6, 2017

Reminder to self: there is one cmp + test pair in both implementations that I don't see a reason for. Could be optimizable.

@dpc
Copy link
Collaborator Author

dpc commented Aug 6, 2017

I have reworked and optimized things a bit more in #14 reaching a speed almost as good as @dsws 's code:

test fastcdc::tests::bench::perf_1mb_008k_chunks ... bench:     516,674 ns/iter (+/- 22,610) = 2029 MB/s

However, I've filled dswd/zvault#10 to show the cost of API that involves temporary copying.

@dpc
Copy link
Collaborator Author

dpc commented Aug 6, 2017

Closing this one, we can continue in #14

@dpc dpc closed this Aug 6, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants